NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

On Classification with Large Language Models in Cultural Analytics

Bamman, David; Chang, Kent; Lucy, Li; Zhou, Naitian (December 2024, Fifth Conference on Computational Humanities Research)

In this work, we survey the way in which classification is used as a sensemaking practice in cultural analytics, and assess where large language models can fit into this landscape. We identify ten tasks supported by publicly available datasets on which we empirically assess the performance of LLMs compared to traditional supervised methods, and explore the ways in which LLMs can be employed for sensemaking goals beyond mere accuracy. We find that prompt-based LLMs are competitive with traditional supervised models for established tasks, but perform less well on de novo tasks. In addition, LLMs can assist sensemaking by acting as an intermediary input to formal theory testing.
more » « less
Full Text Available
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Lucy, Li; Gururangan, Suchin; Soldaini, Luca; Strubell, Emma; Bamman, David; Klein, Lauren; Dodge, Jesse (August 2024, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL))

Large language models’ (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are underscrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten “quality” and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.
more » « less
Full Text Available
Racial and Ethnic Representation in Literature Taught in US High Schools

https://doi.org/10.22148/001c.131682

Lucy, Li; Griffiths, Camilla; Ying, Claire; Kim-Ebio, JJ; Baur, Sabrina; Levine, Sarah; Eberhardt, Jennifer L; Bamman, David; Demszky, Dorottya (January 2025, Journal of Cultural Analytics)

We quantify the representation, or presence, of characters of color in English Language Arts (ELA) instruction in the United States to better understand possible racial/ethnic emphases and gaps in literary curricula. We contribute two datasets: the first consists of books listed in widely-adopted Advanced Placement (AP) Literature & Composition exams, and the second is a set of books taught by teachers surveyed from schools with substantial Black and Hispanic student populations. In addition to these book lists, we provide an unprecedented collection of hand-annotated sociodemographic labels of not only literary authors, but also their characters. We use computational methods to measure all main characters’ presence through three distinct and nuanced metrics: frequency, narrative perspective, and burstiness. Our annotations and measurements show that the sociodemographic composition of characters in books recommended by AP Literature has not shifted much for over twenty years. As a case study of how ELA curricula may deviate from the curricula prescribed by AP, our teacher-provided sample shows a greater emphasis on books featuring first-person, primary characters of color. We also find that only a few books in either dataset feature both White main characters and main characters of color. Arguably, these books may uphold a view of racial/ethnic segregation as a societal norm.
more » « less
Full Text Available
Discovering Differences in the Representation of People using Contextualized Semantic Axes

Lucy, Li; Tadimeti, Divya; Bamman, David (January 2022, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing)

A common paradigm for identifying semantic differences across social and temporal contexts is the use of static word embeddings and their distances. In particular, past work has compared embeddings against “semantic axes” that represent two opposing concepts. We extend this paradigm to BERT embeddings, and construct contextualized axes that mitigate the pitfall where antonyms have neighboring representations. We validate and demonstrate these axes on two people-centric datasets: occupations from Wikipedia, and multi-platform discussions in extremist, men’s communities over fourteen years. In both studies, contextualized semantic axes can characterize differences among instances of the same word type. In the latter study, we show that references to women and the contexts around them have become more detestable over time.
more » « less
Full Text Available
Gender and Representation Bias in GPT-3 Generated Stories

Lucy, Li; Bamman, David (January 2021, Proceedings of the Third Workshop on Narrative Understanding)
null (Ed.)
Using topic modeling and lexicon-based word similarity, we find that stories generated by GPT-3 exhibit many known gender stereotypes. Generated stories depict different topics and descriptions depending on GPT-3’s perceived gender of the character in a prompt, with feminine characters more likely to be associated with family and appearance, and described as less powerful than masculine characters, even when associated with high power verbs in a prompt. Our study raises questions on how one can avoid unintended social biases when using large language models for storytelling.
more » « less
Full Text Available
Characterizing English Variation across Social Media Communities with BERT

Lucy, Li; Bamman, David (January 2021, Transactions of the Association for Computational Linguistics)
Daelemans, Walter (Ed.)
Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.
more » « less
Full Text Available

Search for: All records